Towards Grounded Spatio-Temporal Reasoning